Search CORE

32 research outputs found

Building Subject-aligned Comparable Corpora and Mining it for Truly Parallel Sentence Pairs

Author: Marasek Krzysztof
Wołk Krzysztof
Publication venue: 'Elsevier BV'
Publication date: 29/09/2015
Field of study

Parallel sentences are a relatively scarce but extremely useful resource for many applications including cross-lingual retrieval and statistical machine translation. This research explores our methodology for mining such data from previously obtained comparable corpora. The task is highly practical since non-parallel multilingual data exist in far greater quantities than parallel corpora, but parallel sentences are a much more useful resource. Here we propose a web crawling method for building subject-aligned comparable corpora from Wikipedia articles. We also introduce a method for extracting truly parallel sentences that are filtered out from noisy or just comparable sentence pairs. We describe our implementation of a specialized tool for this task as well as training and adaption of a machine translation system that supplies our filter with additional information about the similarity of comparable sentence pairs

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Shallow reading with Deep Learning: Predicting popularity of online content using only its title

Author: Marasek Krzysztof
Rokita Przemyslaw
Stokowiec Wociech
Trzcinski Tomasz
Wolk Krzysztof
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/07/2017
Field of study

With the ever decreasing attention span of contemporary Internet users, the title of online content (such as a news article or video) can be a major factor in determining its popularity. To take advantage of this phenomenon, we propose a new method based on a bidirectional Long Short-Term Memory (LSTM) neural network designed to predict the popularity of online content using only its title. We evaluate the proposed architecture on two distinct datasets of news articles and news videos distributed in social media that contain over 40,000 samples in total. On those datasets, our approach improves the performance over traditional shallow approaches by a margin of 15%. Additionally, we show that using pre-trained word vectors in the embedding layer improves the results of LSTM models, especially when the training set is small. To our knowledge, this is the first attempt of applying popularity prediction using only textual information from the title

arXiv.org e-Print Archive

Crossref